Initial Look at the Data

trips <- read_csv("daily_trips_and_weather.csv") %>%
  rename(trips = n) %>%
  mutate(month = month(date, label = T, abbr = T),
         day_nm = wday(date, label = T, abbr = T)) %>%
  clean_names() 

skimr::skim(trips)
Data summary
Name trips
Number of rows 1986
Number of columns 44
_______________________
Column type frequency:
character 3
Date 1
factor 2
logical 1
numeric 30
POSIXct 7
________________________
Group variables

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
summary 31 0.98 15 69 0 108 0
icon 22 0.99 3 17 0 6 0
precip_type 497 0.75 4 4 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-04-23 2020-09-30 2018-01-11 1986

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 TRUE 12 May: 186, Jul: 186, Aug: 186, Jun: 180
day_nm 0 1 TRUE 7 Mon: 284, Tue: 284, Wed: 284, Thu: 284

Variable type: logical

skim_variable n_missing complete_rate mean count
ozone 1986 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
trips 0 1.00 1.945880e+03 884.40 1.00 1.263000e+03 1.939000e+03 2.675750e+03 5.386000e+03 ▅▇▇▂▁
moon_phase 9 1.00 5.000000e-01 0.29 0.00 2.500000e-01 5.000000e-01 7.500000e-01 1.000000e+00 ▇▇▇▇▇
precip_intensity 9 1.00 0.000000e+00 0.01 0.00 0.000000e+00 0.000000e+00 0.000000e+00 9.000000e-02 ▇▁▁▁▁
precip_intensity_max 9 1.00 4.000000e-02 0.08 0.00 0.000000e+00 0.000000e+00 3.000000e-02 7.200000e-01 ▇▁▁▁▁
precip_probability 22 0.99 3.400000e-01 0.38 0.00 1.000000e-02 9.000000e-02 7.700000e-01 1.000000e+00 ▇▁▁▁▃
temperature_high 9 1.00 6.699000e+01 18.39 16.94 5.145000e+01 6.861000e+01 8.348000e+01 9.887000e+01 ▁▆▇▇▇
temperature_high_time 9 1.00 1.515856e+09 49544437.68 1429814460.00 1.473022e+09 1.515962e+09 1.558729e+09 1.601400e+09 ▇▇▇▇▇
temperature_low 10 0.99 5.099000e+01 16.42 4.82 3.749000e+01 5.201000e+01 6.557000e+01 8.196000e+01 ▁▆▇▇▆
temperature_low_time 10 0.99 1.515893e+09 49553485.02 1429871760.00 1.473039e+09 1.515970e+09 1.558793e+09 1.601464e+09 ▇▇▇▇▇
apparent_temperature_high 9 1.00 6.685000e+01 20.76 4.15 5.104000e+01 6.829000e+01 8.444000e+01 1.128000e+02 ▁▅▆▇▃
apparent_temperature_high_time 9 1.00 1.515856e+09 49544291.28 1429815660.00 1.473022e+09 1.515968e+09 1.558729e+09 1.601400e+09 ▇▇▇▇▇
apparent_temperature_low 10 0.99 5.057000e+01 18.26 -7.60 3.548000e+01 5.250000e+01 6.642000e+01 9.159000e+01 ▁▃▆▇▃
apparent_temperature_low_time 10 0.99 1.515892e+09 49553421.21 1429870140.00 1.473039e+09 1.515969e+09 1.558793e+09 1.601464e+09 ▇▇▇▇▇
dew_point 9 1.00 4.660000e+01 18.44 -11.08 3.236000e+01 4.867000e+01 6.279000e+01 7.679000e+01 ▁▃▆▇▇
humidity 9 1.00 6.700000e-01 0.14 0.26 5.700000e-01 6.700000e-01 7.700000e-01 9.700000e-01 ▁▃▇▇▃
pressure 9 1.00 1.017390e+03 7.19 992.00 1.012700e+03 1.017100e+03 1.022000e+03 1.041100e+03 ▁▃▇▃▁
wind_speed 9 1.00 4.230000e+00 2.43 0.57 2.410000e+00 3.720000e+00 5.400000e+00 1.759000e+01 ▇▅▂▁▁
wind_gust 9 1.00 1.377000e+01 6.29 1.51 9.330000e+00 1.245000e+01 1.691000e+01 5.166000e+01 ▇▇▂▁▁
wind_gust_time 9 1.00 1.515849e+09 49542960.28 1429814820.00 1.472998e+09 1.515910e+09 1.558670e+09 1.601431e+09 ▇▇▇▇▇
wind_bearing 9 1.00 2.036600e+02 96.21 0.00 1.140000e+02 2.250000e+02 2.870000e+02 3.590000e+02 ▃▅▅▇▇
cloud_cover 9 1.00 5.000000e-01 0.29 0.00 2.600000e-01 4.700000e-01 7.300000e-01 1.000000e+00 ▆▇▇▆▆
uv_index 9 1.00 5.000000e+00 2.41 1.00 3.000000e+00 5.000000e+00 7.000000e+00 1.100000e+01 ▇▆▆▃▁
uv_index_time 9 1.00 1.515847e+09 49544872.36 1429808220.00 1.473008e+09 1.515950e+09 1.558719e+09 1.601399e+09 ▇▇▇▇▇
visibility 9 1.00 9.250000e+00 1.16 1.83 9.070000e+00 9.770000e+00 9.990000e+00 1.000000e+01 ▁▁▁▁▇
temperature_min 9 1.00 5.006000e+01 16.64 4.82 3.608000e+01 5.080000e+01 6.511000e+01 8.196000e+01 ▁▆▇▇▆
temperature_max 9 1.00 6.737000e+01 18.11 16.94 5.226000e+01 6.890000e+01 8.348000e+01 9.887000e+01 ▁▆▇▇▇
apparent_temperature_min 9 1.00 4.948000e+01 18.60 -7.60 3.405000e+01 5.129000e+01 6.596000e+01 9.159000e+01 ▁▅▆▇▃
apparent_temperature_max 9 1.00 6.735000e+01 20.32 4.15 5.180000e+01 6.852000e+01 8.444000e+01 1.128000e+02 ▁▅▆▇▂
precip_accumulation 1870 0.06 3.000000e-01 0.74 0.00 1.000000e-02 6.000000e-02 1.600000e-01 4.460000e+00 ▇▁▁▁▁
row_num 0 1.00 1.000000e+00 0.00 1.00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
sunrise_time 9 1.00 2015-04-23 10:12:00 2020-09-29 10:57:00 2018-01-14 12:22:00 1977
sunset_time 9 1.00 2015-04-23 23:48:00 2020-09-29 22:47:00 2018-01-14 22:00:00 1977
precip_intensity_max_time 117 0.94 2015-04-23 22:00:00 2020-09-29 20:41:00 2018-03-04 19:01:00 1869
temperature_min_time 9 1.00 2015-04-23 10:17:00 2020-09-29 07:22:00 2018-01-14 12:27:00 1977
temperature_max_time 9 1.00 2015-04-23 18:41:00 2020-09-29 17:25:00 2018-01-14 20:38:00 1977
apparent_temperature_min_time 9 1.00 2015-04-24 02:10:00 2020-09-29 07:14:00 2018-01-14 12:52:00 1977
apparent_temperature_max_time 9 1.00 2015-04-23 19:01:00 2020-09-29 17:26:00 2018-01-14 22:13:00 1977

The bikeshare data comes from Inego and according to the documentation its already somewhat cleaned

  • Staff servicing and test trips are removed.
  • Trips below 1 minute are removed.
  • A “Virtual Station” listed in the checkout and return kiosks, is used by staff to check in or check out a bike remotely for a special event or in a situation in which a bike could not otherwise be checked in or out to a station.
  • Trip lengths are capped at 24 hours.
  • Some short round trips or long trips may be the result of system or user error, but have been kept in the dataset for completeness.

The weather data comes from DarkSky, and was obtained using the darksky package.

Missing data in the weather data set is an issue - ozone is 100% missing and precip_accumulation is almost entirely missing. Lets remove them as well as any timestamps since they probably won’t be useful in the model.

trips <- trips %>%
  select(-ends_with("time"), -c(row_num, ozone, precip_accumulation))

skimr::skim(trips)
Data summary
Name trips
Number of rows 1986
Number of columns 28
_______________________
Column type frequency:
character 3
Date 1
factor 2
numeric 22
________________________
Group variables

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
summary 31 0.98 15 69 0 108 0
icon 22 0.99 3 17 0 6 0
precip_type 497 0.75 4 4 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-04-23 2020-09-30 2018-01-11 1986

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 TRUE 12 May: 186, Jul: 186, Aug: 186, Jun: 180
day_nm 0 1 TRUE 7 Mon: 284, Tue: 284, Wed: 284, Thu: 284

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
trips 0 1.00 1945.88 884.40 1.00 1263.00 1939.00 2675.75 5386.00 ▅▇▇▂▁
moon_phase 9 1.00 0.50 0.29 0.00 0.25 0.50 0.75 1.00 ▇▇▇▇▇
precip_intensity 9 1.00 0.00 0.01 0.00 0.00 0.00 0.00 0.09 ▇▁▁▁▁
precip_intensity_max 9 1.00 0.04 0.08 0.00 0.00 0.00 0.03 0.72 ▇▁▁▁▁
precip_probability 22 0.99 0.34 0.38 0.00 0.01 0.09 0.77 1.00 ▇▁▁▁▃
temperature_high 9 1.00 66.99 18.39 16.94 51.45 68.61 83.48 98.87 ▁▆▇▇▇
temperature_low 10 0.99 50.99 16.42 4.82 37.49 52.01 65.57 81.96 ▁▆▇▇▆
apparent_temperature_high 9 1.00 66.85 20.76 4.15 51.04 68.29 84.44 112.80 ▁▅▆▇▃
apparent_temperature_low 10 0.99 50.57 18.26 -7.60 35.48 52.50 66.42 91.59 ▁▃▆▇▃
dew_point 9 1.00 46.60 18.44 -11.08 32.36 48.67 62.79 76.79 ▁▃▆▇▇
humidity 9 1.00 0.67 0.14 0.26 0.57 0.67 0.77 0.97 ▁▃▇▇▃
pressure 9 1.00 1017.39 7.19 992.00 1012.70 1017.10 1022.00 1041.10 ▁▃▇▃▁
wind_speed 9 1.00 4.23 2.43 0.57 2.41 3.72 5.40 17.59 ▇▅▂▁▁
wind_gust 9 1.00 13.77 6.29 1.51 9.33 12.45 16.91 51.66 ▇▇▂▁▁
wind_bearing 9 1.00 203.66 96.21 0.00 114.00 225.00 287.00 359.00 ▃▅▅▇▇
cloud_cover 9 1.00 0.50 0.29 0.00 0.26 0.47 0.73 1.00 ▆▇▇▆▆
uv_index 9 1.00 5.00 2.41 1.00 3.00 5.00 7.00 11.00 ▇▆▆▃▁
visibility 9 1.00 9.25 1.16 1.83 9.07 9.77 9.99 10.00 ▁▁▁▁▇
temperature_min 9 1.00 50.06 16.64 4.82 36.08 50.80 65.11 81.96 ▁▆▇▇▆
temperature_max 9 1.00 67.37 18.11 16.94 52.26 68.90 83.48 98.87 ▁▆▇▇▇
apparent_temperature_min 9 1.00 49.48 18.60 -7.60 34.05 51.29 65.96 91.59 ▁▅▆▇▃
apparent_temperature_max 9 1.00 67.35 20.32 4.15 51.80 68.52 84.44 112.80 ▁▅▆▇▂

precip_type seems to be NA when there is no precipitation, so we can recode that to none. The handfull of remaining missing values we can omit from the data

trips <- trips %>%
  mutate(precip_type = ifelse(is.na(precip_type), "none", precip_type)) %>%
  na.omit() 

skimr::skim(trips)
Data summary
Name trips
Number of rows 1942
Number of columns 28
_______________________
Column type frequency:
character 3
Date 1
factor 2
numeric 22
________________________
Group variables

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
summary 0 1 15 69 0 108 0
icon 0 1 3 17 0 6 0
precip_type 0 1 4 4 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-04-23 2020-09-29 2017-12-27 1942

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 TRUE 12 May: 186, Aug: 186, Jul: 181, Sep: 177
day_nm 0 1 TRUE 7 Sat: 280, Mon: 278, Sun: 277, Wed: 277

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
trips 0 1 1949.60 886.52 1.00 1263.00 1943.50 2682.00 5386.00 ▅▇▇▂▁
moon_phase 0 1 0.50 0.29 0.00 0.25 0.50 0.75 1.00 ▇▇▇▇▇
precip_intensity 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.09 ▇▁▁▁▁
precip_intensity_max 0 1 0.04 0.08 0.00 0.00 0.00 0.03 0.72 ▇▁▁▁▁
precip_probability 0 1 0.33 0.38 0.00 0.01 0.09 0.76 1.00 ▇▁▁▁▃
temperature_high 0 1 67.05 18.38 16.94 51.54 68.75 83.52 98.87 ▁▆▇▇▇
temperature_low 0 1 50.99 16.41 4.82 37.55 52.06 65.56 81.96 ▁▆▇▇▆
apparent_temperature_high 0 1 66.92 20.75 4.15 51.30 68.38 84.43 112.80 ▁▅▆▇▃
apparent_temperature_low 0 1 50.59 18.23 -7.60 35.62 52.55 66.39 91.59 ▁▃▆▇▃
dew_point 0 1 46.60 18.41 -11.08 32.44 48.73 62.72 76.79 ▁▃▆▇▇
humidity 0 1 0.67 0.14 0.26 0.57 0.67 0.77 0.97 ▁▅▇▇▃
pressure 0 1 1017.39 7.18 992.00 1012.70 1017.10 1021.98 1041.10 ▁▃▇▃▁
wind_speed 0 1 4.20 2.43 0.57 2.40 3.68 5.38 17.59 ▇▅▂▁▁
wind_gust 0 1 13.79 6.29 1.51 9.36 12.46 16.95 51.66 ▇▇▂▁▁
wind_bearing 0 1 203.62 96.27 0.00 114.00 225.00 287.75 358.00 ▃▅▅▇▇
cloud_cover 0 1 0.50 0.29 0.00 0.26 0.47 0.73 1.00 ▆▇▇▆▆
uv_index 0 1 5.01 2.41 1.00 3.00 5.00 7.00 11.00 ▇▆▆▃▁
visibility 0 1 9.26 1.15 1.83 9.09 9.77 9.99 10.00 ▁▁▁▁▇
temperature_min 0 1 50.05 16.63 4.82 36.12 50.80 65.08 81.96 ▁▆▇▇▆
temperature_max 0 1 67.42 18.11 16.94 52.33 68.97 83.52 98.87 ▁▅▇▇▇
apparent_temperature_min 0 1 49.48 18.58 -7.60 34.15 51.29 65.92 91.59 ▁▅▆▇▂
apparent_temperature_max 0 1 67.40 20.31 4.15 51.94 68.58 84.43 112.80 ▁▅▆▇▃

Fomatting Helper Function

We can put this all together and turn into a get_trips_data function. This ensures the data is consistent each time we load it, and can serve as a formatting template for the new weather data we will use in the model when we predict.

This function will live in a helper.R file to be used throughout the analysis.

get_trips_data <- function(data){
  
  trips <- data %>%
  rename(trips = n) %>%
  mutate(month = month(date, label = T, abbr = T),
         day_nm = wday(date, label = T, abbr = T)) %>%
  clean_names() %>%
  select(-ends_with(c("time", "min", "low")), starts_with("temperature"), -c(row_num, ozone, precip_accumulation, wind_bearing, wind_gust, moon_phase)) %>%
  mutate(precip_type = ifelse(is.na(precip_type), "none", precip_type)) %>%
  na.omit()

  return(trips)
  
}

Data Exploration

Target

Plotting the daily number of trips over time shows seasonality as well as 1 potential outlier - this could be a data issue.

ggplot(trips, aes(x = trips)) + geom_histogram()

ggplot(trips, aes(x = date, y = trips, color = month, group = 1)) + 
  geom_line()

It turns out this is when the pope was in Philadelphia, so it is a legitimate data point and we can leave it in!

pope visits philly

trips %>%
  filter(trips == max(trips)) %>%
  pull(date)
## [1] "2015-09-27"

Categorical Predictors

categorical_predictors <- trips %>% 
  select_if(., .predicate = is.character) %>%
  names() 

categorical_predictors <- c("trips", categorical_predictors, "day_nm", "month")

get_boxplot <- function(predictor){
  
  df <- trips[names(trips) %in% categorical_predictors]
  
  df <- df %>% 
    pivot_longer(cols = 2:ncol(.)) %>%
    filter(name == {{predictor}}) 
  
  title <- df$name
  
  df %>% 
    ggplot(., aes(x = forcats::fct_reorder(factor(value), trips, .fun = median), y = trips)) +
    geom_boxplot() +
    coord_flip() + 
    labs(x = title, y = "trips", title =  paste('Trips by', title))
}

# check relationship of categorical vars with target
map(categorical_predictors, get_boxplot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

Numeric Predictors

Based on these scatter plots, it makes sense to remove moon_phase, wind_bearing and wind_gust. Below the scatterplots is a correlation matrix of predictors. I am also going to remove the *_min and *_low variables as well since they are highly correlated with their *_max and *_high counterparts.

Both of these changes are implemented in the get_trips_data() function above.

# check relationship of numeric vars with target

numeric_predictors <- trips %>% 
  select_if(., .predicate = is.numeric) %>%
  names() 

numeric_predictors <- c("trips", numeric_predictors)

get_scatterplot <- function(predictor){
  
  df <- trips[names(trips) %in% numeric_predictors]
  
  df <- df %>% 
    pivot_longer(cols = 2:ncol(.)) %>%
    filter(name == {{predictor}}) 
  
  title <- df$name
  
  df %>%
    ggplot(., aes(x = value, y = trips)) + 
    geom_point() + 
    geom_smooth() + 
    labs(x = title, y = "trips", title =  paste('Trips vs', title))
}

map(numeric_predictors, get_scatterplot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]

## 
## [[15]]

## 
## [[16]]

## 
## [[17]]

## 
## [[18]]

## 
## [[19]]

## 
## [[20]]

## 
## [[21]]

## 
## [[22]]

## 
## [[23]]

cor_mat <- trips %>% 
  select_if(., is.numeric) %>%
  select(-trips) %>%
  cor()

heatmaply_cor(cor_mat, 
              symm = TRUE, 
              cexRow = .0001, 
              cexCol = .0001, 
              branches_lwd = .1
)